Author

Jasmine Cabrera

1 The Data

Our dataset(s) in this lab concerns baby names and their popularity over time. At this link, you can find the names for ALL 50 states, in separate datasets organized by first letter. For each year, and for each name with at least 50 recorded babies born, we are given the counts of how many babies had that name.

2 Is My Name Not Cool Anymore?

Let’s take a look at how the name “Allison” has changed over time. As my name begins with “A”, you should download the StateNames_A.csv dataset from the link above.

Code
library(tidyverse)
library(here)
library(DT)
library(gt)
library(kableExtra)

# Imports data
names <- read_csv(here("StateNames_A.csv"))

# Interactive Data Table
DT::datatable(names)

3 Summarizing & Visualizing the Number of Allisons

  1. Make a summary table of the number of babies named “Allison” for each state and the sex of the baby. Specifically, the table should have the following qualities:
  • each state should be its own row

  • and each sex should have its own column

  • if there were no babies born for that combination of state & sex there should be a 0 (not an NA)

The dataset has a column titled Gender, which contains two values "F" and "M", representing “Female” and “Male”. The sex someone was assigned at birth is different from their gender identity (definitions). Thus, this variable should be renamed to Sex or Sex at Birth.

  1. You should have seen in the table above that “Allison” is a name given overwhelmingly to babies assigned “female” at birth. So, create a new data set named allison_f which contains only the babies assigned Female at birth.

This should be the data set you use for the remainder of the assignment.

  1. Make a visualization showing how the popularity of the name “Allison” has changed over the years. To be clear, each year should have one observation–the total number of Allisons born that year.

    Code
    allison <- names |>
      rename(Sex = Gender) |>
      filter(Name == "Allison") |>
      group_by(State, Sex) |>
      summarize(Total = sum(Count), 
                .groups = "drop") |>
      pivot_wider(names_from = Sex,
                  values_from = Total,
                  values_fill = 0) |>
      print()
    # A tibble: 51 × 3
       State     F     M
       <chr> <dbl> <dbl>
     1 AK      232     0
     2 AL     1535     0
     3 AR     1198     0
     4 AZ     1880     0
     5 CA    12413     0
     6 CO     1594     0
     7 CT     1099     0
     8 DC      321     0
     9 DE      294     0
    10 FL     4455     0
    # … with 41 more rows
    Code
    allison |>
      knitr::kable(format = "html",
                   col.names = c("State",
                                 "Female",
                                 "Male"),
                   caption = "Frequency of Allisons in Each State") |>
      kableExtra::kable_classic(html_font = "Arial") |>
      kableExtra::kable_styling(bootstrap_options = "striped")
    Frequency of Allisons in Each State
    State Female Male
    AK 232 0
    AL 1535 0
    AR 1198 0
    AZ 1880 0
    CA 12413 0
    CO 1594 0
    CT 1099 0
    DC 321 0
    DE 294 0
    FL 4455 0
    GA 3257 0
    HI 183 0
    IA 1477 0
    ID 451 0
    IL 5110 0
    IN 3067 0
    KS 1283 0
    KY 1905 20
    LA 1209 0
    MA 2218 0
    MD 2229 0
    ME 340 0
    MI 4014 0
    MN 2374 0
    MO 2882 0
    MS 817 0
    MT 226 0
    NC 3435 0
    ND 285 0
    NE 807 0
    NH 412 0
    NJ 3052 0
    NM 399 0
    NV 729 0
    NY 5747 0
    OH 5487 0
    OK 1421 0
    OR 1186 0
    PA 4307 0
    RI 306 0
    SC 1228 0
    SD 376 0
    TN 2488 0
    TX 10192 0
    UT 1125 0
    VA 3220 0
    VT 135 0
    WA 1956 0
    WI 2367 0
    WV 813 0
    WY 142 0
    Code
    allison_f <- names |>
      rename(Sex = Gender) |>
      filter(Name == "Allison") 
    
    allison_f|>
      group_by(Year) |>
      summarize(Total = sum(Count)) |>
      ggplot(mapping = aes(x = Year,
                           y = Total)) +
      geom_line() +
      labs(y = NULL,
           title = "Total Number of Allisons Born Between 1997-2014") +
      scale_x_continuous(breaks = seq(1997, 2014, by = 1), 
                         guide =  guide_axis(n.dodge = 2)) +
      scale_y_continuous(breaks = seq(5000, 9000, by = 1000), 
                         limits = c(5000, 8000)) + 
      theme(plot.title = element_text(face = "bold",
                                      hjust = 0.5))

    Code
    allison_f
    # A tibble: 914 × 5
       Name     Year Sex   State Count
       <chr>   <dbl> <chr> <chr> <dbl>
     1 Allison  1997 F     AK       19
     2 Allison  1998 F     AK       10
     3 Allison  1999 F     AK       13
     4 Allison  2000 F     AK       14
     5 Allison  2001 F     AK        9
     6 Allison  2002 F     AK       14
     7 Allison  2003 F     AK       14
     8 Allison  2004 F     AK       13
     9 Allison  2005 F     AK       10
    10 Allison  2006 F     AK       18
    # … with 904 more rows

Source:

https://stackoverflow.com/questions/55948425/why-is-scale-y-continuous-not-working-here-no-error-message-given

https://www.datanovia.com/en/blog/ggplot-title-subtitle-and-caption/#:~:text=Change%20the%20font%20appearance%20%28text%20size%2C%20color%20and,font%20face%20include%3A%20%E2%80%9Cplain%E2%80%9D%2C%20%E2%80%9Citalic%E2%80%9D%2C%20%E2%80%9Cbold%E2%80%9D%20and%20%E2%80%9Cbold.italic%E2%80%9D.

4 Modeling the Number of Allisons

  1. Fit a linear model with the year as the explanatory variable, and the number of Allisons as the response. Similar to #3, each year should have one observation–the total number of Allisons born that year.

  2. Write out the estimated regression equation.

    Estimated Total Number of Allisons = 209689.8 -101.5(Year)

    ***Note: Year must be substituted with the actual Year (i.e. 1997, 2014, etc.)

  3. Plot the residuals of the model, that is, the actual values minus the predicted values. Comment on the residuals - do you see any patterns?

There is slight fanning in the residuals plot with the residuals forming a concave up parabola.

The augment() function from the broom package may be helpful!

  1. What do you conclude from this model? Is my name not cool anymore?

We can not conclude that the name “Allison” is decreasing in popularity over the years 1997 to 2014 for babies assigned “female” at birth from this model. This is due to the slight curvature in the residuals plot that suggests that we should not use a straight line to model this data.

Code
allison_lm <- allison_f |>
  group_by(Year) |>
  summarize(Total = sum(Count)) |>
  lm(Total ~ Year, data = _) |>
  print()

Call:
lm(formula = Total ~ Year, data = summarize(group_by(allison_f, 
    Year), Total = sum(Count)))

Coefficients:
(Intercept)         Year  
   209815.1       -101.6  
Code
allison_lm |>
  broom::augment() |>
  ggplot(mapping = aes(y = .resid, x = .fitted)) +
  geom_point() + 
  labs(x = "Fitted",
       y = NULL,
       title = "Residual Plot") +
  scale_x_continuous(breaks = seq(5000, 7000, by = 500), 
                     limits = c(5000, 7000)) + 
  scale_y_continuous(breaks = seq(-1000, 1000, by = 500), 
                     limits = c(-1250, 1250)) + 
  theme(plot.title = element_text(face = "bold",
                                  hjust = 0.5))

5 Spelling by State

In middle school I was so upset with my parents for not naming me “Allyson”. Past my pre-teen rebellion, I’m happy with my name and am impressed when baristas spell it “Allison” instead of “Alison”. But I don’t have it as bad as my good friend Allan!

  1. Narrow the A name dataset (downloaded previously) down to only male-assigned babies named “Allan”, “Alan”, or “Allen”. Make a plot comparing the popularity of these names over time.

It looks like you want to filter for a vector of values. What tools have you learned which can help you accomplish this task?

  1. In California, Allan’s spelling of his name is the least common of the three but perhaps it’s not such an unusual name for his home state of Pennsylvania. Compute the total number of babies born with each spelling of “Allan” in the year 2000, in Pennsylvania and in California. Specifically, the table should have the following qualities:
  • each spelling should be its own column

  • each state should have its own row

  • a 0 (not an NA) should be used to represent locations where there were no instances of these names

  1. Convert your total counts to overall percents. That is, what was the percent breakdown between the three spellings in CA? What about in PA?

In the year 2000, 21.364% of all Male birthed babies were named some form of “Allan”, “Alan”, or “Allen” were born in California and 2.869% were born in Pennsylvania. Looking closer, in the California selection: 19.86% were named Allen, 65.34% were named Alan, and 14.785% were named Allan. In the Pennsylvania selection: 47.05% were named Allen, 42.85% were named Alan, and 1.01% were named Allan.

Code
allans <- tibble(Name = c("Allan", 
                          "Alan", 
                          "Allen"))

allan_names <- names |>
  filter(Gender == "M") |>
  semi_join(allans, by = "Name") |>
  rename("Sex" = "Gender") 

allan_names |>
  group_by(Name, Year) |>
  summarize(Total = sum(Count)) |>
  ggplot(mapping = aes(x = Year,
                       y = Total,
                       color = Name)) +
  geom_line() +
  labs(y = NULL,
       title = "Total Number of Allans Born Between 1997-2014") +
  scale_x_continuous(breaks = seq(1997, 2014, by = 1), 
                     guide =  guide_axis(n.dodge = 2)) +
  scale_y_continuous(breaks = seq(0, 3500, by = 1000), 
                     limits = c(0, 3500)) + 
  theme(plot.title = element_text(face = "bold",
                                  hjust = 0.5)) 

Code
allan <- allan_names |>
  pivot_wider(names_from = "Name",
              values_from = "Count",
              values_fill = 0) |>
  filter(Year == 2000) |>
  mutate(Total = Allan + Allen + Alan,
         across(.cols = Allen:Allan, .fns = ~.x/Total,
                .names = "{col}Prop"),
         OverallProp = Total/sum(Total)) |>
  select(Year, Sex, State, Allen, AllenProp, Alan, AlanProp, Allan, 
         AllanProp, Total, OverallProp) |>
  print()
# A tibble: 43 × 11
    Year Sex   State Allen AllenProp  Alan AlanProp Allan AllanP…¹ Total Overa…²
   <dbl> <chr> <chr> <dbl>     <dbl> <dbl>    <dbl> <dbl>    <dbl> <dbl>   <dbl>
 1  2000 M     AL       27     0.529    24    0.471     0   0         51 0.0123 
 2  2000 M     AR       12     0.4      18    0.6       0   0         30 0.00723
 3  2000 M     AZ       22     0.186    85    0.720    11   0.0932   118 0.0285 
 4  2000 M     CA      176     0.199   579    0.653   131   0.148    886 0.214  
 5  2000 M     CO       14     0.2      51    0.729     5   0.0714    70 0.0169 
 6  2000 M     CT       12     0.462    14    0.538     0   0         26 0.00627
 7  2000 M     DC        0     0         6    1         0   0          6 0.00145
 8  2000 M     FL       78     0.375    97    0.466    33   0.159    208 0.0502 
 9  2000 M     GA       42     0.333    79    0.627     5   0.0397   126 0.0304 
10  2000 M     HI        7     0.467     8    0.533     0   0         15 0.00362
# … with 33 more rows, and abbreviated variable names ¹​AllanProp, ²​OverallProp
Code
allan |>
  knitr::kable(format = "html",
               col.names = c("Year", 
                             "Sex", 
                             "State", 
                             "Allen", 
                             "Allen Proportion in State", 
                             "Alan", 
                             "Alan Proportion in State", 
                             "Allan", 
                             "Allan Proportion in State", 
                             "Total", 
                             "Overall State Proportion in Year"),
               caption = "Summary of Frequency and Proportion of 'Allen', 'Allan, and 'Allan' in Each State") |>
  kableExtra::kable_classic(html_font = "Arial") |>
  kableExtra::kable_styling(bootstrap_options = "striped")
Summary of Frequency and Proportion of 'Allen', 'Allan, and 'Allan' in Each State
Year Sex State Allen Allen Proportion in State Alan Alan Proportion in State Allan Allan Proportion in State Total Overall State Proportion in Year
2000 M AL 27 0.5294118 24 0.4705882 0 0.0000000 51 0.0122980
2000 M AR 12 0.4000000 18 0.6000000 0 0.0000000 30 0.0072341
2000 M AZ 22 0.1864407 85 0.7203390 11 0.0932203 118 0.0284543
2000 M CA 176 0.1986456 579 0.6534989 131 0.1478555 886 0.2136484
2000 M CO 14 0.2000000 51 0.7285714 5 0.0714286 70 0.0168797
2000 M CT 12 0.4615385 14 0.5384615 0 0.0000000 26 0.0062696
2000 M DC 0 0.0000000 6 1.0000000 0 0.0000000 6 0.0014468
2000 M FL 78 0.3750000 97 0.4663462 33 0.1586538 208 0.0501567
2000 M GA 42 0.3333333 79 0.6269841 5 0.0396825 126 0.0303834
2000 M HI 7 0.4666667 8 0.5333333 0 0.0000000 15 0.0036171
2000 M IA 9 0.4285714 12 0.5714286 0 0.0000000 21 0.0050639
2000 M ID 6 0.4285714 8 0.5714286 0 0.0000000 14 0.0033759
2000 M IL 60 0.3191489 114 0.6063830 14 0.0744681 188 0.0453340
2000 M IN 25 0.3289474 38 0.5000000 13 0.1710526 76 0.0183265
2000 M KS 10 0.4166667 14 0.5833333 0 0.0000000 24 0.0057873
2000 M KY 15 0.5357143 13 0.4642857 0 0.0000000 28 0.0067519
2000 M LA 30 0.5660377 18 0.3396226 5 0.0943396 53 0.0127803
2000 M MA 16 0.2711864 35 0.5932203 8 0.1355932 59 0.0142272
2000 M MD 15 0.3947368 23 0.6052632 0 0.0000000 38 0.0091633
2000 M ME 0 0.0000000 7 1.0000000 0 0.0000000 7 0.0016880
2000 M MI 64 0.4740741 55 0.4074074 16 0.1185185 135 0.0325537
2000 M MN 28 0.5283019 25 0.4716981 0 0.0000000 53 0.0127803
2000 M MO 26 0.4814815 20 0.3703704 8 0.1481481 54 0.0130215
2000 M MS 19 0.7037037 8 0.2962963 0 0.0000000 27 0.0065107
2000 M NC 56 0.3943662 70 0.4929577 16 0.1126761 142 0.0342416
2000 M NE 0 0.0000000 7 1.0000000 0 0.0000000 7 0.0016880
2000 M NH 0 0.0000000 5 1.0000000 0 0.0000000 5 0.0012057
2000 M NJ 28 0.2718447 66 0.6407767 9 0.0873786 103 0.0248372
2000 M NM 9 0.3103448 20 0.6896552 0 0.0000000 29 0.0069930
2000 M NV 10 0.2127660 29 0.6170213 8 0.1702128 47 0.0113335
2000 M NY 80 0.3149606 131 0.5157480 43 0.1692913 254 0.0612491
2000 M OH 79 0.5163399 64 0.4183007 10 0.0653595 153 0.0368941
2000 M OK 16 0.4705882 18 0.5294118 0 0.0000000 34 0.0081987
2000 M OR 14 0.4117647 20 0.5882353 0 0.0000000 34 0.0081987
2000 M PA 56 0.4705882 51 0.4285714 12 0.1008403 119 0.0286954
2000 M RI 0 0.0000000 6 1.0000000 0 0.0000000 6 0.0014468
2000 M SC 22 0.5945946 15 0.4054054 0 0.0000000 37 0.0089221
2000 M TN 37 0.4933333 33 0.4400000 5 0.0666667 75 0.0180854
2000 M TX 105 0.1962617 381 0.7121495 49 0.0915888 535 0.1290089
2000 M UT 12 0.4444444 15 0.5555556 0 0.0000000 27 0.0065107
2000 M VA 36 0.4000000 42 0.4666667 12 0.1333333 90 0.0217024
2000 M WA 27 0.3375000 43 0.5375000 10 0.1250000 80 0.0192911
2000 M WI 19 0.3333333 26 0.4561404 12 0.2105263 57 0.0137449
Code
allan_ca_pa <- allan |>
  filter(State %in% c("CA", "PA")) |>
  print()
# A tibble: 2 × 11
   Year Sex   State Allen AllenProp  Alan AlanProp Allan AllanProp Total Overa…¹
  <dbl> <chr> <chr> <dbl>     <dbl> <dbl>    <dbl> <dbl>     <dbl> <dbl>   <dbl>
1  2000 M     CA      176     0.199   579    0.653   131     0.148   886  0.214 
2  2000 M     PA       56     0.471    51    0.429    12     0.101   119  0.0287
# … with abbreviated variable name ¹​OverallProp
Code
allan_ca_pa |>
  knitr::kable(format = "html",
               col.names = c("Year", 
                             "Sex", 
                             "State", 
                             "Allen", 
                             "Allen Proportion in State", 
                             "Alan", 
                             "Alan Proportion in State", 
                             "Allan", 
                             "Allan Proportion in State", 
                             "Total", 
                             "Overall State Proportion in Year"),
               caption = "Summary of Frequency and Proportion of 'Allen', 'Allan, and 'Allan' in California and Pennsylvania",
               digits = 3) |>
  kableExtra::kable_classic(html_font = "Cambria") |>
  kableExtra::kable_styling(bootstrap_options = "striped", 
                            position = "left")
Summary of Frequency and Proportion of 'Allen', 'Allan, and 'Allan' in California and Pennsylvania
Year Sex State Allen Allen Proportion in State Alan Alan Proportion in State Allan Allan Proportion in State Total Overall State Proportion in Year
2000 M CA 176 0.199 579 0.653 131 0.148 886 0.214
2000 M PA 56 0.471 51 0.429 12 0.101 119 0.029